Code
from google.colab import drive
drive.mount('/content/drive')Mounted at /content/drive
Zach Dickson
Feel free to copy this notebook and use it for yourself. If you do not already have python installed already on your computer and you don’t want to install it (how to check), then you can use this notebook in Google Colab.
This option can be especially appealing if you want to use the language model on lots of text, because Google Colab allows you to use a GPU, which increases the speed of the model significantly. I’ll demonstrate how to do this at the end also.
##### if you're using google colab, you'll need to change the file location of the dataset to the appropriate location in order
#### There are a few ways to do this -- one would be to mount your own google drive. Another would be to upload to newspaper file in the Colab notebook in the left hand column.
import pandas as pd # import necessary library
import numpy as np # import necessary library
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.io as pio
pio.renderers.default='colab'
pd.set_option('display.max_columns', 50) ## set max columns to display as 50
df = pd.read_csv('./newspaper_data_all.csv') ## read in the dataset. See the first comment in this box of code
df.date = pd.to_datetime(df.date) # set date column to pandas version of date
df.newspaper = df.newspaper.str.replace('_', ' ').str.title() ## these are just changes to the names of the newspapers for presentation purposes
df.country = df.country.str.title().str.replace('Uk','UK') ## these are just changes to the names of the newspapers for presentation purposes## First five rows of the dataset:
| date | link | title | newspaper | country | |
|---|---|---|---|---|---|
| 0 | 2020-01-01 | https://sport.fakt.pl/inne-sporty/david-stern-... | Zmarł były komisarz ligi NBA David Stern | Fakt | Poland |
| 1 | 2020-01-01 | https://www.fakt.pl/polityka/sylwester-polityk... | "Grzeczny" sylwester polityków. Tylko Jaki się... | Fakt | Poland |
| 2 | 2020-01-01 | https://www.fakt.pl/wydarzenia/polska/zaglada-... | Zagłada ptaków w Polsce. W środę zabijały je w... | Fakt | Poland |
| 3 | 2020-01-01 | https://sport.fakt.pl/pilka-nozna/nietypowa-ci... | Nietypowa cieszynka piłkarzy Cracovii. Przebra... | Fakt | Poland |
| 4 | 2020-01-01 | https://www.fakt.pl/plotki/z-kim-ania-z-rolnik... | Z kim Ania z „Rolnik szuka żony” spędziła sylw... | Fakt | Poland |
Articles per newspaper:
newspaper
Abc Spain 50666
Bild 204491
De Telegraaf 176781
De Welt 42199
El Mundo 39738
El Pais 67419
Fakt 89627
Gazeta Wyborcza 27390
Guardian 149828
Nrc 49361
Rzeczpospolita 20647
Suddeutsche Zeitung 586495
Uk Sun 128994
Uk Times 32062
Volkskrant 62579
dtype: int64
Text(0.5, 1.0, 'Number of Newspapers')
Articles per newspaper, per country
country newspaper
Germany Bild 204491
De Welt 42199
Suddeutsche Zeitung 586495
Netherlands De Telegraaf 176781
Nrc 49361
Volkskrant 62579
Poland Fakt 89627
Gazeta Wyborcza 27390
Rzeczpospolita 20647
Spain Abc Spain 50666
El Mundo 39738
El Pais 67419
UK Guardian 149828
Uk Sun 128994
Uk Times 32062
dtype: int64
(array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]),
[Text(0, 0, 'Bild'),
Text(1, 0, 'De Welt'),
Text(2, 0, 'Suddeutsche Zeitung'),
Text(3, 0, 'De Telegraaf'),
Text(4, 0, 'Nrc'),
Text(5, 0, 'Volkskrant'),
Text(6, 0, 'Fakt'),
Text(7, 0, 'Gazeta Wyborcza'),
Text(8, 0, 'Rzeczpospolita'),
Text(9, 0, 'Abc Spain'),
Text(10, 0, 'El Mundo'),
Text(11, 0, 'El Pais'),
Text(12, 0, 'Guardian'),
Text(13, 0, 'Uk Sun'),
Text(14, 0, 'Uk Times')])
## example code
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TextClassificationPipeline,TFAutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("z-dickson/multilingual_sentiment_newspaper_headlines")
model = TFAutoModelForSequenceClassification.from_pretrained("z-dickson/multilingual_sentiment_newspaper_headlines")
sentiment_classifier = TextClassificationPipeline(tokenizer=tokenizer, model=model, device=0) ## if you're using colab, change the runtime type and add 'device=0' in the parentheses to use a GPU
sentiment_classifier('text we want to get the sentiment for') ## classifies the text we want to get the sentiment forCollecting transformers
Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.7/7.7 MB 60.1 MB/s eta 0:00:00
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers) (3.12.4)
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 295.0/295.0 kB 33.3 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (1.23.5)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers) (23.2)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (6.0.1)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (2023.6.3)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers) (2.31.0)
Collecting tokenizers<0.15,>=0.14 (from transformers)
Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.8/3.8 MB 104.7 MB/s eta 0:00:00
Collecting safetensors>=0.3.1 (from transformers)
Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 74.3 MB/s eta 0:00:00
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers) (4.66.1)
Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.16.4->transformers) (2023.6.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.16.4->transformers) (4.5.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.3.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2.0.6)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2023.7.22)
Installing collected packages: safetensors, huggingface-hub, tokenizers, transformers
Successfully installed huggingface-hub-0.17.3 safetensors-0.4.0 tokenizers-0.14.1 transformers-4.34.0
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TextClassificationPipeline,TFAutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("z-dickson/multilingual_sentiment_newspaper_headlines")
model = TFAutoModelForSequenceClassification.from_pretrained("z-dickson/multilingual_sentiment_newspaper_headlines")
sentiment_classifier = TextClassificationPipeline(tokenizer=tokenizer, model=model, device=0) ## if you're using colab, change the runtime type and add 'device=0' in the parentheses to use a GPU
Some layers from the model checkpoint at z-dickson/multilingual_sentiment_newspaper_headlines were not used when initializing TFBertForSequenceClassification: ['dropout_75']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at z-dickson/multilingual_sentiment_newspaper_headlines.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.
In German, the headline is as follows:
The result is as follows:
negativeThe result is nearly identical
[{'label': 'negative', 'score': 0.998826801776886}]
Next, I’ll work through some examples that you might want to use in an analysis
here’s the dataset:
| date | link | title | newspaper | country | |
|---|---|---|---|---|---|
| 1649067 | 2021-11-03 | https://www.bild.de/sport/mehr-sport/baseball/... | Baseball: Atlanta Braves gewinnen World Series... | Bild | Germany |
| 1125040 | 2021-01-21 | https://www.sueddeutsche.de/sport/basketball-t... | Theis-Gala reicht Celtics nicht zum Sieg über ... | Suddeutsche Zeitung | Germany |
| 1345511 | 2021-12-23 | https://www.sueddeutsche.de/politik/italien-pa... | Fürsorgliches Italien | Suddeutsche Zeitung | Germany |
| 784752 | 2021-05-06 | /economia/2021-05-06/peaje-en-las-autovias-fon... | Peajes en las autovías, fondo para los ERTE y ... | El Pais | Spain |
| 646337 | 2022-06-25 | https://www.telegraaf.nl/nieuws/1431146744/lim... | Limburgse pastoor ’die seksfilmpje toonde’ nie... | De Telegraaf | Netherlands |
| 1030380 | 2020-08-23 | https://www.sueddeutsche.de/muenchen/starnberg... | Bermuda-Dreieck in der Mittelkonsole | Suddeutsche Zeitung | Germany |
| 556442 | 2021-02-12 | https://www.telegraaf.nl/sport/1236508470/dubb... | Dubbel schaatsgoud voor Oranje: ’We moesten wa... | De Telegraaf | Netherlands |
| 1155656 | 2021-03-01 | https://www.sueddeutsche.de/muenchen/erding/er... | Online-Diskussion zur Seenotrettung | Suddeutsche Zeitung | Germany |
| 290975 | 2022-06-27 | https://www.theguardian.com/world/2022/jun/27/... | Nato to put 300,000 troops on high alert in re... | Guardian | UK |
| 1259656 | 2021-08-21 | https://www.sueddeutsche.de/panorama/kriminali... | Vandalen verwüsten Kinder- und Jugendzentrum i... | Suddeutsche Zeitung | Germany |
sentiment_analysis).from logging import raiseExceptions
def sentiment_analysis(dataset, date_start=None, date_end=None):
x = dataset.copy()
try:
if date_start != None:
x.date = pd.to_datetime(x.date)
x = x.loc[(x.date <= f'{date_end}') & (x.date >= f'{date_start}')]
sentiment_list = []
sentiment = sentiment_classifier(x.title.to_list())
sentiment_list.append(sentiment)
x['sentiment'] = sentiment
x['sentiment_label'] = x.sentiment.apply(lambda x: x['label'])
return x
except TypeError:
print("ERROR: ensure date_start and date_end are in the format Year-Month-Day. For example: 2020-01-08")
| date | link | title | newspaper | country | sentiment | sentiment_label | |
|---|---|---|---|---|---|---|---|
| 1649067 | 2021-11-03 | https://www.bild.de/sport/mehr-sport/baseball/... | Baseball: Atlanta Braves gewinnen World Series... | Bild | Germany | {'label': 'positive', 'score': 0.6868038177490... | positive |
| 1125040 | 2021-01-21 | https://www.sueddeutsche.de/sport/basketball-t... | Theis-Gala reicht Celtics nicht zum Sieg über ... | Suddeutsche Zeitung | Germany | {'label': 'neutral', 'score': 0.7599111199378967} | neutral |
| 1345511 | 2021-12-23 | https://www.sueddeutsche.de/politik/italien-pa... | Fürsorgliches Italien | Suddeutsche Zeitung | Germany | {'label': 'neutral', 'score': 0.8194842338562012} | neutral |
| 784752 | 2021-05-06 | /economia/2021-05-06/peaje-en-las-autovias-fon... | Peajes en las autovías, fondo para los ERTE y ... | El Pais | Spain | {'label': 'neutral', 'score': 0.6340370178222656} | neutral |
| 646337 | 2022-06-25 | https://www.telegraaf.nl/nieuws/1431146744/lim... | Limburgse pastoor ’die seksfilmpje toonde’ nie... | De Telegraaf | Netherlands | {'label': 'neutral', 'score': 0.8677178025245667} | neutral |
| 1030380 | 2020-08-23 | https://www.sueddeutsche.de/muenchen/starnberg... | Bermuda-Dreieck in der Mittelkonsole | Suddeutsche Zeitung | Germany | {'label': 'neutral', 'score': 0.9753379821777344} | neutral |
| 556442 | 2021-02-12 | https://www.telegraaf.nl/sport/1236508470/dubb... | Dubbel schaatsgoud voor Oranje: ’We moesten wa... | De Telegraaf | Netherlands | {'label': 'neutral', 'score': 0.976171612739563} | neutral |
| 1155656 | 2021-03-01 | https://www.sueddeutsche.de/muenchen/erding/er... | Online-Diskussion zur Seenotrettung | Suddeutsche Zeitung | Germany | {'label': 'positive', 'score': 0.568717896938324} | positive |
| 290975 | 2022-06-27 | https://www.theguardian.com/world/2022/jun/27/... | Nato to put 300,000 troops on high alert in re... | Guardian | UK | {'label': 'negative', 'score': 0.9975405931472... | negative |
| 1259656 | 2021-08-21 | https://www.sueddeutsche.de/panorama/kriminali... | Vandalen verwüsten Kinder- und Jugendzentrum i... | Suddeutsche Zeitung | Germany | {'label': 'negative', 'score': 0.9977375268936... | negative |
dataset – the dataset that you are passing in to the fuction. This will likely just be the entire newspaper datasetkeyword – this is the keyword you want to use to identify the titles with. In our example, the keyword is borderscountry – this parameter narrows down the results to just a specific countrynewspaper – this parameter narrows down the results to a specific newspaper#### Note: the values that you pass into the function must match the options in the dataset verbatim. The country and newspaper options are presented below for the dataset that we’ve been working with throughout this notebook
Countries:
Newspapers:
array(['Fakt', 'Rzeczpospolita', 'Gazeta Wyborcza', 'Uk Times',
'Guardian', 'Uk Sun', 'Nrc', 'De Telegraaf', 'Volkskrant',
'El Mundo', 'El Pais', 'Abc Spain', 'Suddeutsche Zeitung',
'De Welt', 'Bild'], dtype=object)
### I wrote this function so you don't have to! A function takes in different parameters and outputs the data we want. Don't worry about this code, just run the cell and skip to the next line
def get_titles_by_keyword(dataset, keywords, country=None, newspaper=None, date_start=None, date_end=None):
ndf = pd.DataFrame()
x = dataset.copy()
if newspaper != None:
x = x[x.newspaper == newspaper]
if country != None:
x = x[x.country == country]
if date_start != None:
x.date = pd.to_datetime(x.date)
x = x.loc[(x.date <= f'{date_end}') & (x.date >= f'{date_start}')]
if type(keywords) == str:
keywords = [keywords]
for keyword in keywords:
xx = x[x.title.str.contains(keyword, case=False)]
ndf = pd.concat([ndf, xx])
if len(ndf) > 0:
return ndf
else:
print('Your query did not return anything. Make sure that the country, newspaper, and dates are correct. Did you give a start date that is later than the end date?')
print(' --- ')
print('The optional newspapers are: ' + str(df.newspaper.unique()))The function is named get_titles_by_keyword.
We can call the function, with the keyword borders using the following
The result will appear as follows:
| date | link | title | newspaper | country | |
|---|---|---|---|---|---|
| 110970 | 2020-03-16 | https://www.thetimes.co.uk/article/greece-accu... | Greece accuses Turkey of sending fake migrants... | Uk Times | UK |
| 111030 | 2020-03-23 | https://www.thetimes.co.uk/article/canadas-ind... | Coronavirus: Native Americans close their bord... | Uk Times | UK |
| 111066 | 2020-03-26 | https://www.thetimes.co.uk/article/quest-ends-... | Quest ends at border after 2 years and 14,000 ... | Uk Times | UK |
| 111458 | 2020-05-10 | https://www.thetimes.co.uk/article/chinese-and... | Chinese and Indian troops injured in border br... | Uk Times | UK |
| 111776 | 2020-06-16 | https://www.thetimes.co.uk/article/indian-troo... | 20 Indian troops die in border brawl with Chin... | Uk Times | UK |
| ... | ... | ... | ... | ... | ... |
| 140289 | 2021-11-12 | https://www.thetimes.co.uk/article/french-allo... | Polish border crisis will fuel Channel migrant... | Uk Times | UK |
| 140316 | 2021-11-15 | https://www.thetimes.co.uk/article/britain-and... | France and Germany join Britain’s pledge to st... | Uk Times | UK |
| 140878 | 2022-01-18 | https://www.thetimes.co.uk/article/satellite-i... | Tonga volcano relief efforts hampered as borde... | Uk Times | UK |
| 141119 | 2022-02-15 | https://www.thetimes.co.uk/article/joe-biden-d... | Joe Biden doubts Russia is withdrawing from Uk... | Uk Times | UK |
| 141349 | 2022-03-13 | https://www.thetimes.co.uk/article/russian-air... | Russian airstrikes hit Ukraine military base n... | Uk Times | UK |
105 rows × 5 columns
Which results in the following:
| date | link | title | newspaper | country | |
|---|---|---|---|---|---|
| 110554 | 2020-01-30 | https://www.thetimes.co.uk/article/longest-smu... | Longest smuggling tunnel found under US-Mexico... | Uk Times | UK |
| 110645 | 2020-02-09 | https://www.thetimes.co.uk/article/syria-refug... | Syria refugees: ‘Borders won’t stop us if Assa... | Uk Times | UK |
| 110653 | 2020-02-10 | https://www.thetimes.co.uk/article/trump-cuts-... | Trump cuts aid to fund military and border wal... | Uk Times | UK |
| 110702 | 2020-02-15 | https://www.thetimes.co.uk/article/crash-landi... | Crash Landing On You: Cross-border romantic co... | Uk Times | UK |
| 110787 | 2020-02-25 | https://www.thetimes.co.uk/article/angela-merk... | Contender to succeed Angela Merkel vows to tak... | Uk Times | UK |
| ... | ... | ... | ... | ... | ... |
| 419625 | 2016-08-09 | https://www.thesun.co.uk/news/1581942/ukraine-... | WAR TENSIONS Ukraine warns Russian invasion po... | Uk Sun | UK |
| 419963 | 2016-07-27 | https://www.thesun.co.uk/news/1509975/north-ko... | SNAKE-SPONSORED TERRORISM North Korea accuses ... | Uk Sun | UK |
| 420187 | 2016-07-17 | https://www.thesun.co.uk/news/1457645/gunman-s... | armenia opposition siege Gunmen storm police H... | Uk Sun | UK |
| 420506 | 2016-06-30 | https://www.thesun.co.uk/news/1370262/vladimir... | WORLD WAR THREE FEARS Vladimir Putin threatens... | Uk Sun | UK |
| 420679 | 2016-06-22 | https://www.thesun.co.uk/news/1321708/eu-has-s... | MIGRANT SHAMBLES EU 'has surrendered complete ... | Uk Sun | UK |
1199 rows × 5 columns
Which results in the following:
| date | link | title | newspaper | country | |
|---|---|---|---|---|---|
| 1497255 | 2020-01-05 | https://www.bild.de/regional/leipzig/leipzig-n... | Ab März Pflicht: 20 000 Kinder noch ohne Maser... | Bild | Germany |
| 1506852 | 2020-02-13 | https://www.bild.de/regional/bremen/bremen-akt... | Windpocken-Alarm! Schul-Verbot für Kinder ohn... | Bild | Germany |
| 1511670 | 2020-03-05 | https://www.bild.de/bild-plus/ratgeber/2020/ra... | Impfungen, die Sie nicht kennen – aber brauchen! | Bild | Germany |
| 1515570 | 2020-03-23 | https://www.bild.de/ratgeber/2020/ratgeber/kan... | Pneumokokken-Impfung: Was bringt sie bei Corona? | Bild | Germany |
| 1519010 | 2020-04-07 | https://www.bild.de/unterhaltung/leute/leute/w... | Wegen Verunglimpfung: Droht dem „Überläufer“ e... | Bild | Germany |
| ... | ... | ... | ... | ... | ... |
| 1695518 | 2022-06-05 | https://www.bild.de/ratgeber/gesundheit/gesund... | Schmerzhafte Krankheit: Wer braucht eine Impfu... | Bild | Germany |
| 1696249 | 2022-06-09 | https://www.bild.de/politik/2022/politik/affen... | Affenpocken: Stiko empfiehlt Impfung für Risik... | Bild | Germany |
| 1698781 | 2022-06-21 | https://www.bild.de/bild-plus/ratgeber/2022/ra... | Omikron-Subvariante BA.5: Brauche ich jetzt do... | Bild | Germany |
| 1699254 | 2022-06-23 | https://www.bild.de/regional/stuttgart/stuttga... | 800 000 Spritzen pro Woche - Land bereitet Meg... | Bild | Germany |
| 1700691 | 2022-06-30 | https://www.bild.de/bild-plus/ratgeber/2022/ra... | Experten in BILD: Wie es mit den Corona-Impfun... | Bild | Germany |
872 rows × 5 columns
Say I want to use multiple keywords related to “borders”. A list of related keywords might look like the following:
We can use the same function by passing in a list of keywords. A list in python takes the following form:
You can pass in as many keywords as you want, but they need to be separated by a comma and in square brackets
NOTE: here’s a short tutorial on lists in case you get stuck.
#example
get_titles_by_keyword(dataset = df, keywords = ['border','bounds','confine','boundary','perimeter'])Which returns the following:
| date | link | title | newspaper | country | |
|---|---|---|---|---|---|
| 110554 | 2020-01-30 | https://www.thetimes.co.uk/article/longest-smu... | Longest smuggling tunnel found under US-Mexico... | Uk Times | UK |
| 110645 | 2020-02-09 | https://www.thetimes.co.uk/article/syria-refug... | Syria refugees: ‘Borders won’t stop us if Assa... | Uk Times | UK |
| 110653 | 2020-02-10 | https://www.thetimes.co.uk/article/trump-cuts-... | Trump cuts aid to fund military and border wal... | Uk Times | UK |
| 110702 | 2020-02-15 | https://www.thetimes.co.uk/article/crash-landi... | Crash Landing On You: Cross-border romantic co... | Uk Times | UK |
| 110787 | 2020-02-25 | https://www.thetimes.co.uk/article/angela-merk... | Contender to succeed Angela Merkel vows to tak... | Uk Times | UK |
| ... | ... | ... | ... | ... | ... |
| 326780 | 2016-09-12 | https://www.thesun.co.uk/news/1767339/jeremy-c... | corbyn & gone Jez and George Osborne join Labo... | Uk Sun | UK |
| 327193 | 2016-08-08 | https://www.thesun.co.uk/news/1574143/tories-c... | MAY'S MASSIVE MAJORITY Tories could win '90-se... | Uk Sun | UK |
| 344090 | 2021-08-29 | https://www.thesun.co.uk/news/16001089/homeown... | ExclusiveGRAVE INSULT Homeowner 'desecrated wa... | Uk Sun | UK |
| 411786 | 2017-03-06 | https://www.thesun.co.uk/news/3019037/arrested... | CROSSING THE BOUNDARY Perv arrested after 'sho... | Uk Sun | UK |
| 323734 | 2017-03-29 | https://www.thesun.co.uk/news/3205230/two-majo... | SECURITY REVIEW Major reviews of Parliament se... | Uk Sun | UK |
1331 rows × 5 columns
There are different numbers of newspaper titles across countries and newspapers. This poses a challenge for comparisons. One way to get around this is to measure “attention” to a given topic as a proportion of attention to all topics. For example, if the Sun has 5 headlines about immigration, we can get the Sun’s “attention to immigration” by dividing 5 by the number of total newspaper headlines from the Sun in a given period. If the Sun has 100 articles for the day, and 5 are about immigration, then the Sun’s attention to immigration is .05 or 5%. This allows us to compare the relative emphasis a single newspaper devotes to a single (or multiple) topics. This measurement of attention extends from the Baumgartner & Jones. I’ve also written about the same method using political texts and computational methods here.
\[Attention_{i,t} = \frac{\sum headlines_{i,t}}{\sum headlines_{t}}\]
dataset – can be the entire newspaper datasetkeywords – can be a single keyword or a list of keywords like shown abovenewspaper – name of newspaper of interestfrequency – this takes the values of W (weekly), M (monthly), and Y (yearly) and is the interval at which attention is calculatedtopic – the name of the topic that pertains to the keywords (this can be anything, but the function will return a column with this name for the attention values)
#newspapers:
['Fakt', 'Rzeczpospolita', 'Gazeta Wyborcza', 'Uk Times',
'Guardian', 'Uk Sun', 'Nrc', 'De Telegraaf', 'Volkskrant',
'El Mundo', 'El Pais', 'Abc Spain', 'Suddeutsche Zeitung',
'De Welt', 'Bild']### I wrote this function so you don't have to! A function takes in different parameters and outputs the data we want. Don't worry about this code, just run the cell and skip to the next line
## get proportion of titles that contain keywords out of total number of titles
def get_titles_by_keyword(dataset, keywords, country=None, newspaper=None, date_start=None, date_end=None):
ndf = pd.DataFrame()
x = dataset.copy()
if newspaper != None:
x = x[x.newspaper == newspaper]
if country != None:
x = x[x.country == country]
if date_start != None:
x.date = pd.to_datetime(x.date)
x = x.loc[(x.date <= f'{date_end}') & (x.date >= f'{date_start}')]
if type(keywords) == str:
keywords = [keywords]
for keyword in keywords:
xx = x[x.title.str.contains(keyword, case=False)]
ndf = pd.concat([ndf, xx])
if len(ndf) > 0:
return ndf
else:
print('Your query did not return anything. Make sure that the country, newspaper, and dates are correct. Did you give a start date that is later than the end date?')
print(' --- ')
print('The optional newspapers are: ' + str(df.newspaper.unique()))
def get_proportion_by_keyword(dataset, keywords, newspaper, frequency, topic, date_start=None, date_end=None):
x = dataset.copy()
x.date = pd.to_datetime(x.date)
x = x.query('date >= "2020-01-01" and date <= "2022-05-31"')
if newspaper != None:
x = x.loc[x.newspaper == newspaper]
if date_start != None:
x.date = pd.to_datetime(x.date)
x = x.loc[(x.date <= f'{date_end}') & (x.date >= f'{date_start}')]
try:
x = pd.DataFrame(x.groupby(pd.Grouper(key='date', freq=frequency)).size()).reset_index().rename(columns={0:'total_headlines'})
titles_by_keyword = get_titles_by_keyword(dataset, keywords, newspaper=newspaper)
titles_by_keyword = pd.DataFrame(titles_by_keyword.groupby(pd.Grouper(key='date', freq=frequency)).size()).reset_index().rename(columns={0:'issue_headlines'})
x = x.merge(titles_by_keyword, on='date', how='left')
x.issue_headlines = x.issue_headlines.fillna(0)
x['attention'] = x.issue_headlines/x.total_headlines
x['topic'] = topic
x['newspaper'] = newspaper
return x
except AttributeError:
print('')
| date | total_headlines | issue_headlines | attention | topic | newspaper | |
|---|---|---|---|---|---|---|
| 0 | 2020-01-31 | 1110 | 3 | 0.002703 | immigration | Uk Times |
| 1 | 2020-02-29 | 1062 | 10 | 0.009416 | immigration | Uk Times |
| 2 | 2020-03-31 | 1144 | 13 | 0.011364 | immigration | Uk Times |
| 3 | 2020-04-30 | 1066 | 5 | 0.004690 | immigration | Uk Times |
| 4 | 2020-05-31 | 1077 | 9 | 0.008357 | immigration | Uk Times |
| 5 | 2020-06-30 | 1069 | 1 | 0.000935 | immigration | Uk Times |
| 6 | 2020-07-31 | 1082 | 3 | 0.002773 | immigration | Uk Times |
| 7 | 2020-08-31 | 1138 | 13 | 0.011424 | immigration | Uk Times |
| 8 | 2020-09-30 | 1091 | 9 | 0.008249 | immigration | Uk Times |
| 9 | 2020-10-31 | 1079 | 9 | 0.008341 | immigration | Uk Times |
| 10 | 2020-11-30 | 1047 | 4 | 0.003820 | immigration | Uk Times |
| 11 | 2020-12-31 | 1074 | 2 | 0.001862 | immigration | Uk Times |
| 12 | 2021-01-31 | 34 | 5 | 0.147059 | immigration | Uk Times |
get_proportions_by_keyword. Here’s an example using the keyword “border”, the UK Sun, and a monthly frequency. We’ll name the topic “Borders”Here’s the code to do the above:
which returns the following
| date | total_headlines | issue_headlines | attention | topic | newspaper | |
|---|---|---|---|---|---|---|
| 0 | 2020-01-31 | 1095 | 2 | 0.001826 | Borders | Uk Sun |
| 1 | 2020-02-29 | 872 | 3 | 0.003440 | Borders | Uk Sun |
| 2 | 2020-03-31 | 1101 | 11 | 0.009991 | Borders | Uk Sun |
| 3 | 2020-04-30 | 1240 | 1 | 0.000806 | Borders | Uk Sun |
| 4 | 2020-05-31 | 1155 | 4 | 0.003463 | Borders | Uk Sun |
| 5 | 2020-06-30 | 1094 | 14 | 0.012797 | Borders | Uk Sun |
| 6 | 2020-07-31 | 1203 | 6 | 0.004988 | Borders | Uk Sun |
| 7 | 2020-08-31 | 2260 | 6 | 0.002655 | Borders | Uk Sun |
| 8 | 2020-09-30 | 2419 | 8 | 0.003307 | Borders | Uk Sun |
| 9 | 2020-10-31 | 2295 | 6 | 0.002614 | Borders | Uk Sun |
| 10 | 2020-11-30 | 2377 | 8 | 0.003366 | Borders | Uk Sun |
| 11 | 2020-12-31 | 2301 | 15 | 0.006519 | Borders | Uk Sun |
| 12 | 2021-01-31 | 2384 | 21 | 0.008809 | Borders | Uk Sun |
| 13 | 2021-02-28 | 2372 | 10 | 0.004216 | Borders | Uk Sun |
| 14 | 2021-03-31 | 2741 | 9 | 0.003283 | Borders | Uk Sun |
| 15 | 2021-04-30 | 2596 | 14 | 0.005393 | Borders | Uk Sun |
| 16 | 2021-05-31 | 2328 | 10 | 0.004296 | Borders | Uk Sun |
| 17 | 2021-06-30 | 2318 | 0 | 0.000000 | Borders | Uk Sun |
| 18 | 2021-07-31 | 2264 | 5 | 0.002208 | Borders | Uk Sun |
| 19 | 2021-08-31 | 2026 | 4 | 0.001974 | Borders | Uk Sun |
| 20 | 2021-09-30 | 2035 | 6 | 0.002948 | Borders | Uk Sun |
| 21 | 2021-10-31 | 1972 | 2 | 0.001014 | Borders | Uk Sun |
| 22 | 2021-11-30 | 1852 | 7 | 0.003780 | Borders | Uk Sun |
| 23 | 2021-12-31 | 1694 | 6 | 0.003542 | Borders | Uk Sun |
| 24 | 2022-01-31 | 1815 | 7 | 0.003857 | Borders | Uk Sun |
| 25 | 2022-02-28 | 1659 | 16 | 0.009644 | Borders | Uk Sun |
| 26 | 2022-03-31 | 1780 | 6 | 0.003371 | Borders | Uk Sun |
| 27 | 2022-04-30 | 1425 | 5 | 0.003509 | Borders | Uk Sun |
| 28 | 2022-05-31 | 1506 | 6 | 0.003984 | Borders | Uk Sun |
x = get_proportion_by_keyword(df, 'border', 'Uk Sun', 'M', 'Borders')
pio.renderers.default='colab'
#sns.relplot(x='date', y='attention', hue='topic', data=x, kind='line', aspect=2) ## same plot in seaborn
fig = px.line(x, x="date", y="attention", color='topic',markers=True, title='Attention to borders by UK Sun', height=500)
fig.update_layout(template='plotly_white')
fig.update_layout(
font_family="Courier New",
font_color="black",
title_font_family="Courier New",
title_font_color="black",
font_size = 18,
legend_title_font_color="black",
template='plotly_white',
showlegend=True,
xaxis_title = 'Date',
width=1200)
fig.show(include_plotlyjs=True)x = get_proportion_by_keyword(df, 'border', 'Uk Sun', 'W', 'Borders')
#sns.relplot(x='date', y='attention', hue='topic', data=x, kind='line', aspect=2)
fig = px.line(x, x="date", y="attention", color='topic',markers=True, title='Attention to borders by UK Sun', height=500)
fig.update_layout(template='plotly_white')
fig.update_layout(
font_family="Courier New",
font_color="black",
title_font_family="Courier New",
title_font_color="black",
font_size = 18,
legend_title_font_color="black",
template='plotly_white',
showlegend=True,
xaxis_title = 'Date',
width=1200)
fig.show(include_plotlyjs=True)First I need to identify the keywords related to vaccines. I’ll use a few for each language and store them as lists:
Then I can create new datasets for all the newspapers in Germany. I’ll do this by using the get_titles_by_keyword function and pass in the country so that it returns all the German newspapers:
which provides the following:
| date | link | title | newspaper | country | |
|---|---|---|---|---|---|
| 867760 | 2022-12-29 | https://www.sueddeutsche.de/bayern/bayern-coro... | Bilanz : Bayerns Impfzentren schließen zum Jah... | Suddeutsche Zeitung | Germany |
| 867804 | 2022-12-08 | https://www.sueddeutsche.de/gesundheit/china-c... | Sars-CoV-2 : Heftige Infektionswelle in China ... | Suddeutsche Zeitung | Germany |
| 867873 | 2022-11-22 | https://www.sueddeutsche.de/politik/impfpflich... | Corona-Pandemie : Pfleger und Ärzte brauchen k... | Suddeutsche Zeitung | Germany |
| 867881 | 2022-11-17 | https://www.sueddeutsche.de/gesundheit/kinder-... | Stiko-Empfehlung : Gesunde Kleinkinder brauche... | Suddeutsche Zeitung | Germany |
| 867885 | 2022-11-16 | https://www.sueddeutsche.de/gesundheit/guertel... | Nebenwirkungen der Covid-Impfung? : Verdachtsf... | Suddeutsche Zeitung | Germany |
| ... | ... | ... | ... | ... | ... |
| 1684082 | 2022-04-13 | https://www.bild.de/bild-plus/ratgeber/psychol... | Psychische Form von Gewalt: Schweigen ist grau... | Bild | Germany |
| 1693094 | 2022-05-24 | https://www.bild.de/ratgeber/2022/ratgeber/cov... | Covid19-Empfehlung der Stiko - Gesunde Kinder ... | Bild | Germany |
| 1696690 | 2022-06-11 | https://www.bild.de/bild-plus/ratgeber/kind-fa... | Kinder: Was Sie nach dem Schimpfen immer tun s... | Bild | Germany |
| 1698376 | 2022-06-19 | https://www.bild.de/ratgeber/2022/ratgeber/usa... | USA impfen Kleinste gegen Corona - Corona-Piks... | Bild | Germany |
| 1699624 | 2022-06-25 | https://www.bild.de/regional/leipzig/leipzig-n... | Impfen: Neuer Stichtags-Ärger bei Kliniken und... | Bild | Germany |
5218 rows × 5 columns
I’ll store that dataset of german headlines about vaccines with a new variable called: germany_vaccines
#example
germany_vaccines = get_titles_by_keyword(dataset = df, keywords = german_keywords, country = 'Germany')We’ll then use the new dataset to get the sentiment of each article using the sentiment_analysis function from above and storing the dataset as another variable called sentiment_germany_vaccines:
This will take a few minutes, because we are using the language model to classify each of the 5,000+ headlines about vaccines in Germany.
Here’s what we’ll get if we view that new dataset:
germany_vaccines = get_titles_by_keyword(dataset = df, keywords = german_keywords, country = 'Germany')
germany_vaccines = germany_vaccines.sample(1000) ## we'll only use a sample of 1000 so it doesn't take a long time to classify all the headlines
sentiment_germany_vaccines = sentiment_analysis(germany_vaccines)| date | link | title | newspaper | country | sentiment | sentiment_label | |
|---|---|---|---|---|---|---|---|
| 1246095 | 2021-07-30 | https://www.sueddeutsche.de/muenchen/fuerstenf... | Impfen zeigt deutliche Wirkung | Suddeutsche Zeitung | Germany | {'label': 'neutral', 'score': 0.5086002349853516} | neutral |
| 1310215 | 2021-10-26 | https://www.sueddeutsche.de/gesundheit/gesundh... | USA-Reisen ab November möglich nach Impfung | Suddeutsche Zeitung | Germany | {'label': 'neutral', 'score': 0.5342729687690735} | neutral |
| 1332218 | 2021-12-02 | https://www.sueddeutsche.de/muenchen/fuerstenf... | Die Impfung wirkt | Suddeutsche Zeitung | Germany | {'label': 'positive', 'score': 0.6801331639289... | positive |
| 1315560 | 2021-11-01 | https://www.sueddeutsche.de/gesundheit/gesundh... | Auffrischungsimpfungen in den meisten Heimen z... | Suddeutsche Zeitung | Germany | {'label': 'neutral', 'score': 0.7856541275978088} | neutral |
| 1335496 | 2021-11-29 | https://www.sueddeutsche.de/gesundheit/gesundh... | Polizei startet mit Corona-Auffrischungsimpfungen | Suddeutsche Zeitung | Germany | {'label': 'negative', 'score': 0.8802963495254... | negative |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 1577247 | 2020-12-18 | https://www.bild.de/bild-plus/ratgeber/2020/ra... | Fragen zum Impfplan - Bin ich dick genug für e... | Bild | Germany | {'label': 'neutral', 'score': 0.9369633793830872} | neutral |
| 1583904 | 2021-01-15 | https://www.bild.de/regional/muenchen/muenchen... | Gesundheitsminister Holetschek - „Das Impfen i... | Bild | Germany | {'label': 'positive', 'score': 0.5605814456939... | positive |
| 1148468 | 2021-02-25 | https://www.sueddeutsche.de/gesundheit/gesundh... | Sieben-Tage-Inzidenz sinkt nur langsam: 168.00... | Suddeutsche Zeitung | Germany | {'label': 'negative', 'score': 0.993061363697052} | negative |
| 1251331 | 2021-08-06 | https://www.sueddeutsche.de/gesundheit/kranken... | Corona-Auffrischimpfung: Krankenhäuser bereite... | Suddeutsche Zeitung | Germany | {'label': 'positive', 'score': 0.7920066118240... | positive |
| 1586770 | 2021-01-27 | https://www.bild.de/news/inland/news-inland/di... | Diese Rentner wollen endlich ihre Corona-Impfu... | Bild | Germany | {'label': 'positive', 'score': 0.7067196369171... | positive |
1000 rows × 7 columns
If we want to see the totals of how many positive, negative and neutral headlines about vaccines by newspapers:
| newspaper | sentiment_label | 0 | |
|---|---|---|---|
| 0 | Bild | negative | 95 |
| 1 | Bild | neutral | 85 |
| 2 | Bild | positive | 59 |
| 3 | De Welt | negative | 32 |
| 4 | De Welt | neutral | 33 |
| 5 | De Welt | positive | 8 |
| 6 | Suddeutsche Zeitung | negative | 189 |
| 7 | Suddeutsche Zeitung | neutral | 281 |
| 8 | Suddeutsche Zeitung | positive | 218 |
For that, we just need to pass in the german_keywords to the get_proportion_by_keywords function. Because the function requires the newspapaer name, we’ll create three separate datasets for each of the German newspapers:
Remember, the German newspapers are: Suddeutsche Zeitung, De Welt, and Bild
attention_SZ = get_proportion_by_keyword(df, keywords = german_keywords, newspaper = 'Suddeutsche Zeitung', frequency = 'M', topic = 'Vaccines')
attention_DW = get_proportion_by_keyword(df, keywords = german_keywords, newspaper = 'De Welt', frequency = 'M', topic = 'Vaccines')
attention_Bild = get_proportion_by_keyword(df, keywords = german_keywords, newspaper = 'Bild', frequency = 'M', topic = 'Vaccines')Now we can plot the attention to vaccines by any of the newspapers:
attention_SZ = get_proportion_by_keyword(df, keywords = german_keywords, newspaper = 'Suddeutsche Zeitung', frequency = 'M', topic = 'Vaccines')
attention_DW = get_proportion_by_keyword(df, keywords = german_keywords, newspaper = 'De Welt', frequency = 'M', topic = 'Vaccines')
attention_Bild = get_proportion_by_keyword(df, keywords = german_keywords, newspaper = 'Bild', frequency = 'M', topic = 'Vaccines')#sns.relplot(x='date', y='attention', hue='newspaper', data=attention_SZ, kind='line', aspect=2)
fig = px.line(attention_SZ, x="date", y="attention", color='newspaper',markers=True, title='Attention', height=500)
fig.update_layout(template='plotly_white')
fig.update_layout(
font_family="Courier New",
font_color="black",
title_font_family="Courier New",
title_font_color="black",
font_size = 18,
legend_title_font_color="black",
template='plotly_white',
showlegend=True,
xaxis_title = 'Date',
width=1200)
fig.show(include_plotlyjs=True)We can combine the datasets and then re-create the same plot:
# combine dataset:
german_newspapers_combined = pd.concat([attention_SZ, attention_DW, attention_Bild])
# reset index
german_newspapers_combined.reset_index(inplace=True)
# recreate plot:
sns.relplot(x='date', y='attention', hue='newspaper', data = german_newspapers_combined, kind='line', aspect=2)
# add title (optional)
plt.title('Attention to Vaccines in German Newspapers')which gives the following result:
german_newspapers_combined = pd.concat([attention_SZ, attention_DW, attention_Bild])
## reset index
german_newspapers_combined.reset_index(inplace=True)
#recreate plot:
fig = px.line(german_newspapers_combined, x="date", y="attention", color='newspaper',markers=True, title='Attention to Vaccines in German Newspapers', height=500)
fig.update_layout(
font_family="Courier New",
font_color="black",
title_font_family="Courier New",
title_font_color="black",
font_size = 18,
legend_title_font_color="black",
template='plotly_white',)
fig.show(include_plotlyjs=True)german = ['grenz','schengen','Reisebeschränkung', 'Reiseverbot', 'Einreiseverbot', 'Mobilitätsbeschränkung', 'Schlagbaum']
polish = ['granica', 'Schengen', 'ograniczenie', 'zakaz', 'podróży', 'Zakazy', 'mobilności', 'szlaban']
spanish = ['front','schengen','restriccion','prohibición', 'prohibiciones','movilidad']
dutch = ['grens','schengen','reisbeperking', 'reisverbod', 'inreisverbod', 'mobilititeitsbeperking', 'slagboom']
english = ['border','schengen','restriction','prohibition', 'prohibitions','mobility','travel ban','entry ban']
dic = {'Poland': 'polish', 'Germany': 'german', 'Spain': 'spanish', 'Netherlands': 'dutch', 'UK': 'english'}
df['language'] = df['country'].map(dic)languages = ['german', 'polish', 'spanish', 'dutch', 'english']
attn= pd.DataFrame()
for language in languages:
for newspaper in df.loc[df.language == language].newspaper.unique():
try:
attention = get_proportion_by_keyword(df, keywords = eval(language), newspaper = newspaper, frequency = 'M', topic = 'Borders')
attention['country'] = language
attn = pd.concat([attn, attention])
except TypeError:
pass
attn=attn.reset_index(drop=True)
lang_to_country = {'german': 'Germany', 'polish': 'Poland', 'spanish': 'Spain', 'dutch': 'Netherlands', 'english': 'UK'}
attn['country'] = attn['country'].map(lang_to_country)Your query did not return anything. Make sure that the country, newspaper, and dates are correct. Did you give a start date that is later than the end date?
---
The optional newspapers are: ['Fakt' 'Rzeczpospolita' 'Gazeta Wyborcza' 'Uk Times' 'Guardian' 'Uk Sun'
'Nrc' 'De Telegraaf' 'Volkskrant' 'El Mundo' 'El Pais' 'Abc Spain'
'Suddeutsche Zeitung' 'De Welt' 'Bild']
x = attn.groupby(['country','date']).mean().reset_index()
fig = px.line(x, x="date", y="attention", color='country',markers=True, title='Attention to Borders in European Newspapers', height=500)
fig.update_layout(
font_family="Courier New",
font_color="black",
title_font_family="Courier New",
title_font_color="black",
font_size = 18,
legend_title_font_color="black",
template='seaborn',
)
fig.show(include_plotlyjs=True)
#sns.relplot(x='date', y='attention', hue='country', data=attn, kind='line', aspect=2)
#plt.title('Attention to Borders in European Newspapers')FutureWarning:
The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
### Let's get the sentiment of the headlines that contain the keywords over time
languages = ['german', 'polish', 'spanish', 'dutch', 'english']
attn1= pd.DataFrame()
for newspaper in df.newspaper.unique():
try:
lang = df.loc[df.newspaper == newspaper].language.unique()[0]
attention = get_titles_by_keyword(df, keywords = eval(lang), newspaper = newspaper)
attention['newspaper'] = newspaper
attn1 = pd.concat([attn1, attention])
except TypeError:
passYour query did not return anything. Make sure that the country, newspaper, and dates are correct. Did you give a start date that is later than the end date?
---
The optional newspapers are: ['Fakt' 'Rzeczpospolita' 'Gazeta Wyborcza' 'Uk Times' 'Guardian' 'Uk Sun'
'Nrc' 'De Telegraaf' 'Volkskrant' 'El Mundo' 'El Pais' 'Abc Spain'
'Suddeutsche Zeitung' 'De Welt' 'Bild']
Text(0.5, 1.0, 'Sentiment of Headlines about Borders in European Newspapers')
FutureWarning:
The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
x = attn2.groupby(['country', 'date']).mean().reset_index()
x['Sentiment'] = x.sentiment_centered
x['Country'] = x.country
fig = px.line(x, x='date',y='Sentiment',color='Country',markers=True, title='Dynamic Sentiment of Headlines about Borders in European Newspapers', height=500,
color_discrete_sequence=px.colors.qualitative.T10)
fig.update_layout(
font_family="Courier New",
font_color="white",
title_font_family="Courier New",
title_font_color="white",
font_size = 18,
legend_title_font_color="white",
template='plotly_white'
)
fig.update_layout(
font_family="Courier New",
font_color="black",
title_font_family="Courier New",
title_font_color="black",
font_size = 18,
legend_title_font_color="black",
template='plotly_white',
showlegend=True,
xaxis_title = 'Date',
width=1200)
fig.update_traces(
marker=dict(
size=10),
line=dict(
width=3
)
)
fig.show(include_plotlyjs=True)
#sns.relplot(x='date', y='sentiment_centered', hue='country', data=attn2, kind='line', aspect=2)
#plt.title('Dynamic Sentiment of Headlines about Borders in European Newspapers')FutureWarning:
The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
Classifying the sentiment of the headlines can take a long time. For example, the ~5k German vaccines headlines took about 5 minutes. This poses a big challenge if we want to classify many more headlines.
One alternative is to use a GPU for the matrix multiplication that occurs billions of times when using transformers models. CPUs are not very good at this task, so using a GPU can be MUCH faster. Luckily, Google Colab allows free use of GPUs. Google Colab is probably the best way to run this entire notebook just because Python seems to be challenging for a lot of people to get install locally, so even if you don’t want to use a GPU, Google Colab might be the best option.
Google has a tutorial on how to use a GPU in Colab here. Once you enable the GPU, the only change you’ll need to make to the language is model is to add in device=0 when defining the classifier. This would look like the following:
sentiment_classifier = TextClassificationPipeline(tokenizer=tokenizer, model=m1, device=0) ## add in device=0 to use the GPUinstead of this: